DSC365: Introduction to Data Science

Author

Spatial Data

Published

November 19, 2024

What is Spatial Data?

Spatial data relates to a geographical area or location:

  • Tobler’s First Law of Geography: things close in space are more related than things further in space.
  • Going to talk about how to manage and visualize spatial data in R. Modeling goes beyond the scope of this class.
#Necessary Packages
library(tidyverse)
library(RColorBrewer)
library(mosaic)
library(mdsr)
library(RColorBrewer)
library(sp)
library(sf)

Common Spatial Data Structure: Shapefiles

  • Spatial data consists of geometric items like points, lines, and polygons; not rows and columns
  • Contain “instructions” on drawing boundaries for countries, etc.

Package focus: sf

  • tidyverse friendly package that contains commonly used functions for spatial objects

Visualizing Spatial Data

Example: You know nothing, John Snow…

In the 1850s, there was a cholera outbreak in Victorian London. A physician, Dr. John Snow, took a unique approach to studying the outbreak: he mapped where the patients lived.

head(CholeraDeaths)
Simple feature collection with 6 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 529308.7 ymin: 181006 xmax: 529336.7 ymax: 181031.4
Projected CRS: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
  Id Count                  geometry
1  0     3 POINT (529308.7 181031.4)
2  0     2 POINT (529312.2 181025.2)
3  0     1 POINT (529314.4 181020.3)
4  0     1 POINT (529317.4 181014.3)
5  0     4 POINT (529320.7 181007.9)
6  0     2   POINT (529336.7 181006)
coords <- do.call(rbind, st_geometry(CholeraDeaths)) %>% as.data.frame() %>% setNames(c("X","Y"))
coords$count = CholeraDeaths$Count

coords%>% ggplot(aes(x=X, y=Y)) + geom_point(aes(col = count))

Problem… this is not a map. It is hard to see where those points are in 1850s London.

Let’s add some context using the coordinates (units in degrees), but still lacking context

ggplot(CholeraDeaths) +
  geom_sf()

Add London Street Map:

library(ggspatial)
ggplot(CholeraDeaths) + 
  annotation_map_tile(type = "osm", zoomin = 0) + 
  geom_sf(aes(size = Count), alpha = 0.7)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |======================================================================| 100%

Map Projections:

A note of map projections:

  • In this map both CholeraDeaths and the map tiles retrieved by the ggspatial package have geospatial coordinates, but those coordinates are not in the same unit
  • Earth happens to be an oblate spheroid—a three-dimensional flattened sphere. Yet we would like to create two-dimensional representations of the Earth that fit on pages or computer screens
    • Converting from a 3d coordinate system to a 2d is called a projection
  • How you project your data influences how your map is perceived.
  • Would need to update the coordinate reference system of ChloreaDeaths fo fix this problem
library(mapproj)
library(maps)


map("world", projection = "mercator", wrap = TRUE)

map("world", projection = "cylequalarea", param = 45, wrap = TRUE)

Let’s try to fix our map projections

sf::st_crs(CholeraDeaths)
Coordinate Reference System:
  User input: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs 
  wkt:
PROJCRS["unknown",
    BASEGEOGCRS["unknown",
        DATUM["Unknown based on Airy 1830 ellipsoid",
            ELLIPSOID["Airy 1830",6377563.396,299.3249646,
                LENGTHUNIT["metre",1,
                    ID["EPSG",9001]]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8901]]],
    CONVERSION["unknown",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",49,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",-2,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996012717,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",400000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",-100000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
cholera_latlong <- CholeraDeaths %>%
  st_set_crs(27700) %>%
  st_transform(4326)


ggplot(cholera_latlong) + 
  annotation_map_tile(type = "osm", zoomin = 0) + 
  geom_sf(aes(size = Count))


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%

Other methods to visualizing Spatial Data

Google API key

#devtools::install_github("dkahle/ggmap")
library(ggmap)
m <- get_map('John Snow, London, England', maptype='roadmap')
ggmap(m)

In mid-2018, Google Maps began requiring users to have a registered API key.

  • Users have to set up an account with Google, enable the relevant APIs, and then tell R about the user’s set up.
  • Part of this process requires a credit card. It’s totally lame. Don’t do it.

Instructions to use Google API key (if you insist):

  • To obtain an API key and enable services, go to https://cloud.google.com/maps-platform/.
  • To tell ggmap about your API key, use register_google(), e.g. register_google(key = “mQkzTpiaLYjPqXQBotesgif3EfGL2dbrNVOrogg”) (that’s a fake key). This will set your API key for the current session, but if you restart R, you’ll need to do it again.

Free Alternative: tmap

library(tmap)

map <- qtm(CholeraDeaths, symbols.col = "Count", symbols.id="Count")
tmap_leaflet(map)

Another Free Alternative: leaflet

Leaflet is an open-source JavaScript library for interactive maps. This R package makes it easy to create Leaflet maps from R.

library(leaflet)
library(readr)

bigfoot_data <- read_csv("https://query.data.world/s/egnaxxvegdkzzrhfhdh4izb6etmlms")


bigfoot_data %>%
  filter(classification == "Class A") %>%
  mutate(seasoncolor = str_replace_all(season, c("Fall" = "orange",
                                                 "Winter" = "skyblue",
                                                 "Spring" = "green",
                                                 "Summer" = "yellow")),
         # This code just wraps the description to the width of the R terminal
         # and inserts HTML for a line break into the text at appropriate points
         desc_wrap = purrr::map(observed, ~strwrap(.) %>%
                                  paste(collapse = "<br/>") %>%
                                  htmltools::HTML())) %>%
leaflet() %>%
  addTiles() %>%
  addCircleMarkers(~longitude, ~latitude, color = ~seasoncolor, label = ~desc_wrap)

Another Example: Education levels in the US

The Education.csv data set contains information on educational attainment for each county in the United States.

First: We need Nebraska shapefiles to create the boundaries. We are going to use the tigris package, which includes shapefiles from the US Census Bureau

library(tigris)

US_counties <- counties(cb = TRUE)

US_counties <- US_counties %>% arrange(GEOID)
glimpse(US_counties)
Education <- read.csv("Education.csv")
US_counties$GEOID = as.numeric(US_counties$GEOID)
Education$FIPS.Code = as.numeric(Education$FIPS.Code)

Education_Counties <- US_counties %>% inner_join(Education, by=c('GEOID'='FIPS.Code'))
glimpse(Education_Counties)
Rows: 3,211
Columns: 27
$ STATEFP              <chr> "01", "01", "01", "01", "01", "01", "01", "01", "…
$ COUNTYFP             <chr> "001", "003", "005", "007", "009", "011", "013", …
$ COUNTYNS             <chr> "00161526", "00161527", "00161528", "00161529", "…
$ AFFGEOID             <chr> "0500000US01001", "0500000US01003", "0500000US010…
$ GEOID                <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1…
$ NAME                 <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount"…
$ NAMELSAD             <chr> "Autauga County", "Baldwin County", "Barbour Coun…
$ STUSPS               <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ STATE_NAME           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ LSAD                 <chr> "06", "06", "06", "06", "06", "06", "06", "06", "…
$ ALAND                <dbl> 1539631461, 4117724893, 2292160151, 1612188713, 1…
$ AWATER               <dbl> 25677536, 1132887353, 50523213, 9572302, 14860281…
$ State                <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ Area.name            <chr> "Autauga County", "Baldwin County", "Barbour Coun…
$ X2003Rural.Urban     <int> 2, 4, 6, 1, 1, 6, 6, 3, 6, 8, 1, 9, 7, 9, 8, 6, 3…
$ X2003Urban.Influence <int> 2, 5, 6, 1, 1, 6, 6, 2, 5, 6, 1, 10, 11, 10, 4, 5…
$ X2013Rural.Urban     <int> 2, 3, 6, 1, 1, 6, 6, 3, 6, 6, 1, 9, 7, 9, 8, 4, 3…
$ X2013Urban.Influence <int> 2, 2, 6, 1, 1, 6, 6, 2, 5, 6, 1, 10, 11, 10, 4, 5…
$ LessHS2000           <dbl> 21.3, 18.0, 35.3, 36.8, 29.6, 39.5, 32.2, 26.1, 3…
$ HSonly2000           <dbl> 33.8, 29.6, 32.4, 35.7, 36.0, 35.2, 34.5, 32.2, 3…
$ SomeCollege2000      <dbl> 26.9, 29.3, 21.3, 20.4, 24.8, 17.5, 22.9, 26.4, 2…
$ CollegePlus2000      <dbl> 18.0, 23.1, 10.9, 7.1, 9.6, 7.7, 10.4, 15.2, 9.5,…
$ LessHS2013           <dbl> 12.3, 9.8, 26.9, 17.9, 20.2, 28.6, 18.9, 16.8, 19…
$ HSonly2013           <dbl> 33.6, 27.8, 35.5, 43.9, 32.3, 36.6, 40.4, 32.2, 3…
$ SomeCollege2013      <dbl> 29.1, 31.7, 25.5, 25.0, 34.4, 21.4, 24.5, 33.1, 2…
$ CollegePlus2013      <dbl> 25.0, 30.7, 12.0, 13.2, 13.1, 13.4, 16.1, 17.9, 1…
$ geometry             <MULTIPOLYGON [°]> MULTIPOLYGON (((-86.9212 32..., MULT…

Plot education levels by county in Nebraska: less than a high school degree.

NE_Education <- Education_Counties %>% filter(State=='NE')
favstats(~LessHS2013, data=NE_Education)
 min  Q1 median  Q3  max     mean       sd  n missing
 1.1 6.1    7.8 9.9 28.8 8.548387 4.329188 93       0

ggplot2

ggplot(NE_Education) + geom_sf(aes(fill = LessHS2013)) + scale_fill_distiller(palette='Blues', direction=+1)

ggplot(NE_Education) + geom_sf(aes(fill = LessHS2013)) + scale_fill_distiller(palette='Blues', direction=+1) + geom_sf_label(aes(label = NAME), fill = "white", size = 2)

leaflet

greens <- colorBin("Greens", domain = c(0, 30), bins = 6)

leaflet(NE_Education) %>% 
  addTiles() %>% 
  addPolygons(color = "black",
                opacity = 1,
                weight = 1,
                fillColor = ~greens(LessHS2013),
                fillOpacity = 0.7,
                label = ~paste0("County: ", Area.name, ' (', LessHS2013, '%)')) %>%
  addLegend(position = 'bottomleft',
            colors = greens(seq(from=2.5, to=27.5, length=6)),
            labels = c('0-5%', '5.1-10%', '10.1-15%', '15.1-20%', '20.1-25%', '25.1-30%'),
            title = 'Less than a HS Degree (2013)')

Try Yourself!

One variable we might be interested in is the margin: difference between percentage of those graduating with a High School only degree between 2013 and 2000. Try to make a new variable called Margin which represents the difference between those graduating from High School only, and then make a plot.

NE_Education <- NE_Education %>% mutate(MarginDiff = HSonly2013-HSonly2000)

ggplot(NE_Education) + geom_sf(aes(fill = MarginDiff)) + scale_fill_distiller(palette='RdBu', direction=-1)

palette <- colorBin("RdBu", domain = c(-70, 70), bins = 9, reverse = TRUE)

leaflet(NE_Education) %>% 
  addTiles() %>% 
  addPolygons(color = "black",
                opacity = 1,
                weight = 1,
                fillColor = ~palette(MarginDiff),
                fillOpacity = 0.7,
                label = ~paste0("Name: ", NAME))

Another Example: Election Results

Daily Kos compiled results by congressional district for the 2008, 2012, and 2016 presidential elections, this data is saved as CongressionalElections.csv. Let’s use this data to visualize election results in the US.

  • FiveThirtyEight identifies the states of Colorado, Florida, Iowa, Michigan, Minnesota, Ohio, Nevada, New Hampshire, North Carolina, Pennsylvania, Virginia, and Wisconsin as “perennial” swing states that have regularly seen close contests over the last few presidential campaigns.
CongressionalElections <- read.csv("CongressionalElections.csv")
glimpse(CongressionalElections)
Rows: 435
Columns: 9
$ CD          <chr> "AK-AL", "AL-01", "AL-02", "AL-03", "AL-04", "AL-05", "AL-…
$ Incumbent   <chr> "Young, Don", "Byrne, Bradley", "Roby, Martha", "Rogers, M…
$ Party       <chr> "(R)", "(R)", "(R)", "(R)", "(R)", "(R)", "(R)", "(D)", "(…
$ Clinton2016 <dbl> 37.6, 34.1, 33.0, 32.3, 17.4, 31.3, 26.1, 69.8, 30.2, 41.7…
$ Trump2016   <dbl> 52.8, 63.5, 64.9, 65.3, 80.4, 64.7, 70.8, 28.6, 65.0, 52.4…
$ Obama2012   <dbl> 41.2, 37.4, 36.4, 36.8, 24.0, 34.9, 24.7, 72.4, 36.3, 42.9…
$ Romney2012  <dbl> 55.3, 61.8, 62.9, 62.3, 74.8, 63.9, 74.3, 27.1, 61.0, 54.7…
$ Obama2008   <dbl> 38.1, 38.5, 35.0, 36.6, 25.5, 36.3, 25.0, 71.5, 39.2, 44.3…
$ McCain2008  <dbl> 59.7, 60.9, 64.5, 62.6, 73.3, 62.6, 74.1, 28.1, 58.0, 53.8…

Let’s look at the swing state of North Carolina

NC_congressional <- congressional_districts(cb = TRUE, state = "North Carolina", year = 2016) 
NC_congressional <- NC_congressional %>% arrange(CD115FP)

NC_data <- CongressionalElections %>% filter(str_detect(CD, 'NC'))

NC_congressional$Clinton2016 <- NC_data$Clinton2016
NC_congressional$Trump2016 <- NC_data$Trump2016

NC_congressional <- NC_congressional %>% mutate(Margin2016 = Trump2016-Clinton2016)

ggplot(NC_congressional) + geom_sf(aes(fill = Margin2016)) + scale_fill_distiller(palette='RdBu', direction=-1)

leaflet(NC_congressional) %>% 
  addTiles() %>% 
  addPolygons(color = "black",
                opacity = 1,
                weight = 1,
                fillColor = ~palette(Margin2016),
                fillOpacity = 0.7,
                label = ~paste0("District: ", CD115FP))